{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create entry points to spark\n",
    "try:\n",
    "    sc.stop()\n",
    "except:\n",
    "    pass\n",
    "from pyspark import SparkContext, SparkConf\n",
    "from pyspark.sql import SparkSession\n",
    "sc=SparkContext()\n",
    "spark = SparkSession(sparkContext=sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RDD object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The class `pyspark.SparkContext` creates a client which connects to a Spark cluster. This client can be used to create an RDD object. There are two methods from this class for directly creating RDD objects:\n",
    "* `parallelize()`\n",
    "* `textFile()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `parallelize()`\n",
    "\n",
    "`parallelize()` distribute a local **python collection** to form an RDD. Common built-in python collections include `dist`, `list`, `tuple` or `set`.\n",
    "\n",
    "Examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from a list\n",
    "rdd = sc.parallelize([1,2,3])\n",
    "rdd.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['cat', 'dog', 'fish']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from a tuple\n",
    "rdd = sc.parallelize(('cat', 'dog', 'fish'))\n",
    "rdd.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('cat', 'dog', 'fish'), ('orange', 'apple')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from a list of tuple\n",
    "list_t = [('cat', 'dog', 'fish'), ('orange', 'apple')]\n",
    "rdd = sc.parallelize(list_t)\n",
    "rdd.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['fish', 'dog', 'cat']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from a set\n",
    "s = {'cat', 'dog', 'fish', 'cat', 'dog', 'dog'}\n",
    "rdd = sc.parallelize(s)\n",
    "rdd.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When it is a `dict`, only the keys are used to form the RDD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'c']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from a dict\n",
    "d = {\n",
    "    'a': 100,\n",
    "    'b': 200,\n",
    "    'c': 300\n",
    "}\n",
    "rdd = sc.parallelize(d)\n",
    "rdd.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `textFile()`\n",
    "\n",
    "The `textFile()` function reads a text file and returns it as an **RDD of strings**. Usually, you will need to apply some **map** functions to transform each elements of the RDD to some data structure/type that is suitable for data analysis.\n",
    "\n",
    "**When using `textFile()`, each line of the text file becomes an element in the resulting RDD.**\n",
    "\n",
    "Examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[',mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb',\n",
       " 'Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4',\n",
       " 'Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4',\n",
       " 'Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1',\n",
       " 'Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read a csv file\n",
    "rdd = sc.textFile('../../data/mtcars.csv')\n",
    "rdd.take(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Fresh install of XP on new computer. Sweet relief! fuck vista\\t1018769417\\t1.0',\n",
       " 'Well. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\\t10284216536\\t1.0',\n",
       " '\"Literally six weeks before I can take off \"\"SSC Chair\"\" off my email. Its like the torturous 4th mile before everything stops hurting.\"\\t10298589026\\t1.0',\n",
       " 'Mitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\\t109017669432377344\\t1.0',\n",
       " \"'Cheap Eats in SLP' - http://t.co/4w8gRp7\\t109642968603963392\\t1.0\"]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read a txt file\n",
    "rdd = sc.textFile('../../data/twitter.txt')\n",
    "rdd.take(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
